About the Data

This report explores a dataset which contains 4,898 white wines with 11 variables on quantifying the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating betweern 0 (very bad) and 10 (very excellent).

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
##  [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
##  [7] "total.sulfur.dioxide" "density"              "pH"                  
## [10] "sulphates"            "alcohol"              "quality"
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.878  
##  3rd Qu.:6.000  
##  Max.   :9.000
## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

We dropped the column X which appears to be the row number. quality is an ordered categorical variable with score between 0 and 10. It’s interesting to see that our wine experts don’t rate the wines as extreme as of score 0(very bad), 1, 2, or 10 (very excellent). The actual range is from 3 to 9 with median at 6. The rest of variables are continuous variables which makes sense since they represents the amount of the corresponding substance in the wine, based on physicochemical tests.

Univariate Plots Section

From this histogram of quality counts, we can see it’s a normal distribution with mean (solid line) and median (dashed line) with almost the same value.
Most of the white wine has a quality of 6, and second place is 5.

## 
## FALSE  TRUE 
##  3838  1060
## 
## midiocre  premium 
##     3838     1060

I remember my wine teacher always talks about Pareto principle (80/20 rule) in the wine industry (Yes, I had a wine teacher). Wine of quality 7, 8, 9 makes up 27.6% of the total number of white wines rated. Therefore, we will consider quality of 7, 8, 9 as premium.

## 
## FALSE  TRUE 
##  4613   285

All wines contains sulphur dioxide in various forms, collectively known as sulphites. Even in completely unsulphured wine it is present at concentration of up to 10 mg/L. Commercially-made wines contain from ten to twenty times that amount. (Source: morethanorganic)

Reasons why SO2 is not desirable in wine:

According to EU law, the maximum permitted level of SO2 in white/rose wine is 210 mg/l. As you can see in the first histogram, there are 285 wines exceeded this limit. And we can observe that all three of them have a right-skewed distribution. This might be due to the restriction of the sulphate and most of the vineyards would obey the rules and avoid exceeding the limit.

## [1] 3.188267
## [1] 3.18

In this set of histograms, we explore the acidity in wines. We have the first three variables which are the amount of corresponding acid found in the wines. The fourth variable pH indicates the acidity level where 7 is neutral and smaller the value is, more acidic the liquid is. We observe a right skewed distribution of the first three and a normal distribution of PH with median and mean at 3.18 (acidic). It makes sense to have the PH histogram not right skewed as the above 3 ones since the outliers in the acidity histogram would have a lower PH value (tail on the left of PH histogram).

Some people believe that sweeter a wine is, the more alcohol it should contain. We cannot tell this just by looking at the histogram here yet. We will more into it in the bivariate plot section. Here we can see both residual sugar and salt have very right skewed distribution. And the amount of salt is really tiny for all white wines with maximum of 0.346 g/L. Histogram of alcohol is a bit right skewed with peaks at around 9 - 9.5 %, it also is quite uniform distributed other than the peak points. Most wines have alcohol level of 8.5 - 12 %.

Univariate Analysis

What is the structure of your dataset?

The white wine quality dataset consists of 4898 observations and 12 variables. Each observation is a white variant of the Portuguese “Vinho Verde” wine. Among the 12 variables, there are 11 input variables (numeric) which represent the amount of corresponding substance existing in the wine based on physicochemical tests. The output variable quality is based on sensory data (median of at least 3 evaluations made by wine experts), and it is an ordered categorical data with range between 0 (very bad) and 10 (very excellent).

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is quality. I am curious in knowing how does the amount of other factors affect the rating from the wine experts.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

By reading the description of the dataset here, I think volatile acidity, citric acid, free sulfur dioxide, total sulfur dioxide, density may support my investigation. Because they seems to affect the smell, taste and color. density may contribute to the effect of “wine curtains” which is also a essential part of wine tasting.

Did you create any new variables from existing variables in the dataset?

Yes, I created a new categorical variable type to indicate whether a wine is premium or mediocre where premium wines are the ones rated above 7 quality and mediocre the rest.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I dropped the column X which appears to be the row number. I also changed quality to a ordered categorical variable with score between 0 and 10. Many variables have a right skewed distribution with outliers on the far end of the tail. However, quality is quite normal distribution with no extreme value like 0, 1, 2, or 10. I haven’t removed the “outliers” from the dataset because at this point I am not sure if their extremeness contribute to the feature of interest.

Bivariate Plots Section

To have broad overview of what variables might be interesting, a scatterplot matrix sounds like a good idea.

An interesting observation is that the outliers they happen more at the middle range qualities (5, 6, 7) than the extreme values. Very small amounts of outliers can be observed for 9-quality or 3-quality wines.

If you look at the boxplot at quality 9 for each factor, notice that the “box” is generally smaller than other qualities (especially density, sulfur.dioxide). This suggests that there is a specific set of charateristics in order to be rated as an “very excellent” quality Portuguese “Vinho Verde” white wine. At this point, I’m impressed by the wine experts who rated these wines. Just by blind tasting, they can detect the excellent wine with the exact right amount of each substances.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

From the set of boxplots, we can observe that alcohol seems to be appreciated. With higher alcohol level, the median rating of quality is generally higher.

pH, fixed.acid and citric acid shows slight positive correlation as well.

On the other side, sulfur.oxide, sugar, and density are not appreciated, negatively correlated to quality.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

We can observe that there is strong (0.853) correlation between density and residual.sugar which is what I suspected before.

It’s only nutural to see that free.sulfur.dioxide and total.sulfur dioxide has a correlation of 0.61.

Also as suspected before, sugar and density has a strong correlation of 0.839. All other factors somewhat contribute to density a bit as we can see the correlation ranges from 0.15 to 0.839 for density with other factors except for the factor volatile acidity (corr: 0.0271).

Surprisingly, alcohol and residual.sugar have a negative correlation of -0.427. alcohol and density also have a strong negative correlation of -0.711, which makes sense since density and residual.sugar are highly positively correlated.

From the boxplots on the quality column, we suspect that alcohol, total.sulfur.dioxide, and density have some effects on the ratings of wine quality by the wine experts.

What was the strongest relationship you found?

The strongest relationship I found is between residual.sugar and density. They have a correlation of 0.853. density and alcohol also has a strong negative correlation of -0.78.

Multivariate Plots Section

To make this set of plots, outliers (residual.sugar > 30) are removed from the dataset.

We can see that the strong correlation between density and sugar doesn’t change at no matter what quality.

Observe the second plot, we can see that at same level of sugar, premium wines are less dense than midiocre wines. Mediocre wine also have a bigger range of residual.sugar level (the outliers we didn’t show are also mediocre wines).

total.sulfur.dioxide and density are not as correlated sugar with density but we can observe the same trend that the line of fit for premium is lower than midiocre.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

sugar and density seems to strengthen each other in terms of looking at quality. At the same sugar level, premium wines tend to have less density than mediocre wines. Extremely high sugar level has lower chance of being rated as excellent wines.

Were there any interesting or surprising interactions between features?

Wines at quality levele 5, 6, 7 always have extreme level in features like residual.sugar and sulfur.dioxide. This is surprising as they are not rated as bad wines (level 2,3) but OK wines.


Final Plots and Summary

Plot One

Description One

This is a histogram of quality counts of the wines. The dashed lines is the median and solid line is mean. We can see that it’s a normal distribution with mean and median at 6.

Plot Two

Description Two

This is a correlation plot of all the numeric variables. We can see there is strong positive correlation between density and residual.sugar. density and alcohol has strong negative correlation. alcohol tend to have negative correlation with the the rest of the variables. There is self explainatory postive correlation between free.sulfur.dioxide and total.sulfur.dioxide. total.sulfur.dioxide and density also have some interesting correlation.

Plot Three

Description Three

From this plot, we can see that at same level of sugar, premium wines are less dense than midiocre wines. Mediocre wine also have a bigger range of residual.sugar level (the outliers we didn’t show here are also mediocre wines).


Reflection

At the beginning it was hard to understand what does each numeric variables mean and how could they affect the quality of wine. After doing some research and read more carefully on the documentation of the dataset, it became more clear how I could explore this dataset. Another struggle is that there is really subtle differences in the amount of variables, you can see from the scatterplots that all the points are kind of all cluster together, it’s hard to visualize when you just put quality as color in the same scatterplot. Maybe some tranformation of data could be used in the future, to make it possible to visually separate the clusters.